-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch … #18229
Conversation
…and should import it before use Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@yao-matrix @liangan1 please review |
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. Note that it needs to be documented if you want users to be able to use this integration properly.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sgugger document has been uploaded |
Hi, @sgugger ,this fix is aligned with what we do in the accelerate PR, without the correct module import, the DDP could not work with CCL backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting some documentation. I know the code is the same as for Accelerate but I have a couple of comments on the doc before we can merge this.
For PyTorch-1.10: | ||
|
||
``` | ||
pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable | ||
``` | ||
For PyTorch-1.11: | ||
|
||
``` | ||
pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable | ||
``` | ||
For PyTorch-1.12: | ||
|
||
``` | ||
pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem likely that we will remember to add each new PyTorch version, so maybe just say
For PyTorch-1.10: | |
``` | |
pip install oneccl_bind_pt==1.10.0 -f https://software.intel.com/ipex-whl-stable | |
``` | |
For PyTorch-1.11: | |
``` | |
pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable | |
``` | |
For PyTorch-1.12: | |
``` | |
pip install oneccl_bind_pt==1.12.0 -f https://software.intel.com/ipex-whl-stable | |
``` | |
```bash | |
pip install oneccl_bind_pt=={pytorch_version} -f https://software.intel.com/ipex-whl-stable | |
``` | |
where `{pytorch_version}` should be you PyTorch version, for instance 1.12.0 |
and add a comment if the micro should always stay at 0 and/or a link to the list of supported versions you have.
|
||
# Efficient Training on Multiple CPUs | ||
|
||
When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When training on a single CPU is too slow, we will use multiple CPUs, This guide focuses on PyTorch-based DDP enabling and how to do it efficiently. | |
When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently. |
|
||
## Intel® oneCCL Bindings for PyTorch | ||
|
||
Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add links for "oneCCL documentation" and "oneCCL specification" here.
|
||
Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification. | ||
|
||
oneccl_bindings_for_pytorch module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it oneccl_bind_pt
or oneccl_bindings_for_pytorch
? Also what is a "ProcessGroup API"? Not sure this sentence adds anything to the doc.
Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). | ||
|
||
### Usage in Trainer | ||
To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enable DDP in Trainer with ccl backend, users should add **`--xpu_backend ccl`** in training command arguments. | |
To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--xpu_backend ccl`** in the command arguments. |
|
||
Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) | ||
|
||
following command enables **2DDP** in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following command enables **2DDP** in one Xeon node, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. | |
The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. |
2DDP won't mean anything to the user.
--no_cuda \ | ||
--xpu_backend ccl | ||
``` | ||
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. | |
The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. |
``` | ||
following command enables **4DDP** in two Xeons (node0 and node1, taking node0 as the master), ppn(processes per node) is set to 2, with one process running per one socket, OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. | ||
|
||
in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in node0, you need to create a config file which contains ip of each node(for ex: hostfile) and pass to mpirun as a argument | |
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. |
xxx.xxx.xxx.xxx #node0 ip | ||
xxx.xxx.xxx.xxx #node1 ip | ||
``` | ||
run the following command in node0 and **4DDP** will be enabled in node0 and node1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run the following command in node0 and **4DDP** will be enabled in node0 and node1 | |
Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1: |
|
||
### Intel® oneCCL Bindings for PyTorch installation: | ||
|
||
Wheel files are avaiable for the following Python versions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wheel files are avaiable for the following Python versions: | |
Wheel files are available for the following Python versions: |
@sgugger thanks for the careful review. doc is updated based one your comment |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this!
huggingface#18229) * start from 1.12, torch_ccl is renamed as oneccl_bindings_for_pytorch and should import it before use Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * add doc for perf_train_cpu_many Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update doc Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
…and should import it before use
Signed-off-by: Wang, Yi A yi.a.wang@intel.com
What does this PR do?
when run the transformer with torch 1.12 and we should pip install one ccl (version 1.12) as well to enable DDP finetune in cpu.
python -m pip install oneccl_bind_pt==1.12.0 -f https://developer.intel.com/ipex-whl-stable
from 1.12.0 the module name will be changed to oneccl_bindings_for_pytorch. and should be imported before use. or else
error will happen.
Fixes # (issue)
as described above.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Library: